One of the most important continuous surveillance programs is the Behavioral Risk Factor Surveillance System (BRFSS), which gathers state-specific information on health-related risk behaviors, chronic health issues, and the use of preventive interventions throughout the US. In an effort to support the creation of knowledgeable public health policies and programs targeted at enhancing the general health and well-being of the American people, the BRFSS is dedicated to delivering consistent, trustworthy information. The aim of this project is to solve the challenge given by the Behavioral Risk Factor Surveillance System (BRFSS) dataset in developing prediction models and actionable insights for public health. The goal is to use machine learning models to improve our understanding of demographics, lifestyle factors, and chronic diseases, with a special focus on depression, diabetes, and heart disease. This study aims to construct predictive models for general health status by doing exploratory data analysis on the large BRFSS dataset, taking into account crucial factors such as age and body mass index. The ultimate goal is to assist informed decision-making and facilitate focused treatments for chronic illness prevention.
The first step in the analysis is fundamental data exploration, which includes importing libraries and reviewing the information, structure, and descriptive statistics of the dataset. The basis for further in-depth analysis and machine learning applications is a challenging check for missing values.
Importing necessary libraries
# For computations using data frames and mathematics
import numpy as np
import pandas as pd
# For Visualisation
import seaborn as sns
from scipy import stats
import plotly.express as px
import matplotlib.pyplot as plt
#For metrics related to model selection and assessment
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
df = pd.read_csv('CVD_cleaned.csv')
df.head(5)
| General_Health | Checkup | Exercise | Heart_Disease | Skin_Cancer | Other_Cancer | Depression | Diabetes | Arthritis | Sex | Age_Category | Height_(cm) | Weight_(kg) | BMI | Smoking_History | Alcohol_Consumption | Fruit_Consumption | Green_Vegetables_Consumption | FriedPotato_Consumption | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Poor | Within the past 2 years | No | No | No | No | No | No | Yes | Female | 70-74 | 150.0 | 32.66 | 14.54 | Yes | 0.0 | 30.0 | 16.0 | 12.0 |
| 1 | Very Good | Within the past year | No | Yes | No | No | No | Yes | No | Female | 70-74 | 165.0 | 77.11 | 28.29 | No | 0.0 | 30.0 | 0.0 | 4.0 |
| 2 | Very Good | Within the past year | Yes | No | No | No | No | Yes | No | Female | 60-64 | 163.0 | 88.45 | 33.47 | No | 4.0 | 12.0 | 3.0 | 16.0 |
| 3 | Poor | Within the past year | Yes | Yes | No | No | No | Yes | No | Male | 75-79 | 180.0 | 93.44 | 28.73 | No | 0.0 | 30.0 | 30.0 | 8.0 |
| 4 | Good | Within the past year | No | No | No | No | No | No | No | Male | 80+ | 191.0 | 88.45 | 24.37 | Yes | 0.0 | 8.0 | 4.0 | 0.0 |
In order to start our research, an initial review of the structural features of the dataset shows a complete picture. With a large dataset of 308,854 rows, the dataset has 19 distinctive features that collectively offer an extensive amount of health-related data.
df.shape
(308854, 19)
A thorough analysis of the dataset's data is carried out in an effort to gain a more detailed understanding. This includes extracting essential details such as null counts, dataset shape, column names, and the sorts of data it includes. This thorough summary is the first step toward understanding the structure and basic characteristics of the dataset.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 308854 entries, 0 to 308853 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 General_Health 308854 non-null object 1 Checkup 308854 non-null object 2 Exercise 308854 non-null object 3 Heart_Disease 308854 non-null object 4 Skin_Cancer 308854 non-null object 5 Other_Cancer 308854 non-null object 6 Depression 308854 non-null object 7 Diabetes 308854 non-null object 8 Arthritis 308854 non-null object 9 Sex 308854 non-null object 10 Age_Category 308854 non-null object 11 Height_(cm) 308854 non-null float64 12 Weight_(kg) 308854 non-null float64 13 BMI 308854 non-null float64 14 Smoking_History 308854 non-null object 15 Alcohol_Consumption 308854 non-null float64 16 Fruit_Consumption 308854 non-null float64 17 Green_Vegetables_Consumption 308854 non-null float64 18 FriedPotato_Consumption 308854 non-null float64 dtypes: float64(7), object(12) memory usage: 44.8+ MB
A thorough examination of descriptive statistics is conducted in order to go further into the complexity of the dataset. To do this, one must extract the count, mean, minimum, maximum, standard deviation, and quartiles—all important statistical analysis. Through the process of removing these statistical issues, a deeper knowledge of the numerical features of the dataset can be achieved.
df.describe()
| Height_(cm) | Weight_(kg) | BMI | Alcohol_Consumption | Fruit_Consumption | Green_Vegetables_Consumption | FriedPotato_Consumption | |
|---|---|---|---|---|---|---|---|
| count | 308854.000000 | 308854.000000 | 308854.000000 | 308854.000000 | 308854.000000 | 308854.000000 | 308854.000000 |
| mean | 170.615249 | 83.588655 | 28.626211 | 5.096366 | 29.835200 | 15.110441 | 6.296616 |
| std | 10.658026 | 21.343210 | 6.522323 | 8.199763 | 24.875735 | 14.926238 | 8.582954 |
| min | 91.000000 | 24.950000 | 12.020000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 163.000000 | 68.040000 | 24.210000 | 0.000000 | 12.000000 | 4.000000 | 2.000000 |
| 50% | 170.000000 | 81.650000 | 27.440000 | 1.000000 | 30.000000 | 12.000000 | 4.000000 |
| 75% | 178.000000 | 95.250000 | 31.850000 | 6.000000 | 30.000000 | 20.000000 | 8.000000 |
| max | 241.000000 | 293.020000 | 99.330000 | 30.000000 | 120.000000 | 128.000000 | 128.000000 |
In order to ensure the dataset's behavior and reliability, a thorough evaluation of the data types that make up each column is carrying out. This is a crucial step to assure consistency and coherence across the dataset, providing a strong basis for further analytical work.
df.dtypes
General_Health object Checkup object Exercise object Heart_Disease object Skin_Cancer object Other_Cancer object Depression object Diabetes object Arthritis object Sex object Age_Category object Height_(cm) float64 Weight_(kg) float64 BMI float64 Smoking_History object Alcohol_Consumption float64 Fruit_Consumption float64 Green_Vegetables_Consumption float64 FriedPotato_Consumption float64 dtype: object
A thorough examination of null values in the dataset is conducted as part of an organized evaluation intended to ensure data quality assurance. After a thorough examination, there are no missing values found in any of the dataset's columns, which is a satisfying result. The cleanness of the dataset is highlighted by the lack of null values, which gives confidence in the accuracy of the subsequent studies.
df.isnull().sum()
General_Health 0 Checkup 0 Exercise 0 Heart_Disease 0 Skin_Cancer 0 Other_Cancer 0 Depression 0 Diabetes 0 Arthritis 0 Sex 0 Age_Category 0 Height_(cm) 0 Weight_(kg) 0 BMI 0 Smoking_History 0 Alcohol_Consumption 0 Fruit_Consumption 0 Green_Vegetables_Consumption 0 FriedPotato_Consumption 0 dtype: int64
The dataset's structural alignment is improved by a careful conversion of column data types. This thorough conversion process is intended to ensure consistency and compliance with existing data guidelines, creating an organized framework that allows smooth data preprocessing.
df = df.astype({
'General_Health': 'string',
'Exercise': 'string',
'Heart_Disease': 'string',
'Skin_Cancer': 'string',
'Other_Cancer': 'string',
'Depression': 'string',
'Diabetes': 'string',
'Arthritis': 'string',
'Sex': 'string',
'Smoking_History': 'string',
'Checkup': 'object',
'Age_Category': 'string',
'Height_(cm)': 'int64',
'Weight_(kg)': 'float64',
'BMI': 'float64',
'Alcohol_Consumption': 'int64',
'Fruit_Consumption': 'int64',
'Green_Vegetables_Consumption': 'int64',
'FriedPotato_Consumption': 'int64'
})
print(df.dtypes)
General_Health string Checkup object Exercise string Heart_Disease string Skin_Cancer string Other_Cancer string Depression string Diabetes string Arthritis string Sex string Age_Category string Height_(cm) int64 Weight_(kg) float64 BMI float64 Smoking_History string Alcohol_Consumption int64 Fruit_Consumption int64 Green_Vegetables_Consumption int64 FriedPotato_Consumption int64 dtype: object
We are introducing 'remove_outliers,' a useful tool to improve data integrity and support robust analysis. With this function, which is customized to work with a DataFrame (df) and an optional z-score threshold (z_threshold), z-scores for numerical columns in the dataset are carefully calculated. It then proceeds to systematically remove rows in which any z-score is greater than the threshold set by the user in order to reduce the effect of outliers. To further improve its usefulness, the method uses Label Encoding to add categorical variable encoding to the DataFrame, which improves consistency and increases computational effectiveness. In order to illustrate how effective it is, the 'remove_outliers' function is applied to the DataFrame df. This allows it to be used to encode categorical columns in the resulting refined dataset. This all-encompassing strategy represents a flexible methodology that is in line with industry best practices for preparing data in order to get it ready for further analytical activities.
def remove_outliers(df, z_threshold=3):
z_scores = np.abs(stats.zscore(df.select_dtypes(include=['int64', 'float64'])))
df_no_outliers = df[(z_scores < z_threshold).all(axis=1)]
return df_no_outliers
df = remove_outliers(df)
# Encoding Categorical Variables
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
categorical_columns = df.select_dtypes(include=['string', 'object']).columns
for column in categorical_columns:
df[column] = label_encoder.fit_transform(df[column])
Utilizing 'extract_age,' an advanced utility function that aims to improve the understanding of age categories in a dataset. Once an age category is input, this method carefully handles a variety of representations, including range indications ('18-24,' '25-34,' and singular values ('65.') The function pays particular attention to age groups indicated by a '+,' which corresponds to '80+' years. The function returns a fixed value of 80 in these cases. When age groups with a '-', which represents a range, occur, the function proactively calculates the mean of the range. When age categories don't have a '-' or '+,' which indicates a single age, the function immediately turns it into an integer. In order to establish a more complex and consistent representation of age-related data, the 'extract_age' function is used to replicate the values from the current 'Age_Category' column and generate a new 'Age' column in the DataFrame df. This function demonstrates practical usefulness.
# Function to extract the age
def extract_age(age_category):
if '+' in age_category:
# Handling '80+' as a specific case
return 80
else:
# Splitting the range and calculating the average
age_range = age_category.split('-')
if '-' in age_category:
return (int(age_range[0]) + int(age_range[1])) / 2
else:
return int(age_category)
# Applying the function to create the 'Age' column
df['Age'] = df['Age_Category']
df
| General_Health | Checkup | Exercise | Heart_Disease | Skin_Cancer | Other_Cancer | Depression | Diabetes | Arthritis | Sex | Age_Category | Height_(cm) | Weight_(kg) | BMI | Smoking_History | Alcohol_Consumption | Fruit_Consumption | Green_Vegetables_Consumption | FriedPotato_Consumption | Age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 10 | 150 | 32.66 | 14.54 | 1 | 0 | 30 | 16 | 12 | 10 |
| 1 | 4 | 4 | 0 | 1 | 0 | 0 | 0 | 2 | 0 | 0 | 10 | 165 | 77.11 | 28.29 | 0 | 0 | 30 | 0 | 4 | 10 |
| 2 | 4 | 4 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 8 | 163 | 88.45 | 33.47 | 0 | 4 | 12 | 3 | 16 | 8 |
| 3 | 3 | 4 | 1 | 1 | 0 | 0 | 0 | 2 | 0 | 1 | 11 | 180 | 93.44 | 28.73 | 0 | 0 | 30 | 30 | 8 | 11 |
| 4 | 2 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 12 | 191 | 88.45 | 24.37 | 1 | 0 | 8 | 4 | 0 | 12 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 308848 | 2 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 7 | 168 | 58.97 | 20.98 | 0 | 0 | 16 | 12 | 0 | 7 |
| 308849 | 4 | 4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 168 | 81.65 | 29.05 | 0 | 4 | 30 | 8 | 0 | 1 |
| 308851 | 4 | 0 | 1 | 0 | 0 | 0 | 1 | 3 | 0 | 0 | 2 | 157 | 61.23 | 24.69 | 1 | 4 | 40 | 8 | 4 | 2 |
| 308852 | 4 | 4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 9 | 183 | 79.38 | 23.73 | 0 | 3 | 30 | 12 | 0 | 9 |
| 308853 | 0 | 4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 160 | 81.19 | 31.71 | 0 | 1 | 5 | 12 | 1 | 5 |
276089 rows × 20 columns
Initiating a crucial preprocessing step, the extraction of weight values from the 'Weight(kg)' column and subsequent retrieval of centimeter-based height values from the 'Height(cm)' column is performed. The height numbers must be converted to meters in order to compute the Body Mass Index (BMI). This can be done by dividing the height values by 100. By using the BMI formula, ((df['Height_(cm)'] / 100) ** 2) gives the square of the height in meters. This thorough method provides a precise and uniform metric for evaluating body mass across the dataset, the BMI for each dataset entry is calculated by dividing the weight in kilograms by the squared height in meters.
# Simplified BMI calculation
df['BMI'] = df['Weight_(kg)'] / ((df['Height_(cm)'] / 100) ** 2)
To produce a thorough representation of the distribution of heart disease in the given dataset (df), using Plotly Express library to create a count plot. The graphic clearly shows the incidence of cardiac disease, where a value of 1 indicates the condition's existence and a value of 0 indicates its absence. The resulting figure, appropriately named "Distribution of Heart Disease," shows the frequency of each "Heart_Disease" categorical class on the x-axis. With its clear and informative summary, this graphical depiction is a useful tool for determining the prevalence of heart disease within the dataset.
Output:
When the output is examined, a low incidence of heart disease cases is found, supporting the claim that a small percentage of the dataset has this medical condition.
fig = px.histogram(
df,
x='Heart_Disease',
color='Heart_Disease',
labels={'Heart_Disease': 'Heart Disease'},
title='Distribution of Heart Disease',
)
fig.update_layout(
xaxis_title='Heart Disease',
yaxis_title='Count',
)
fig.show()
Correlation analyses examining the 'Heart_Disease' column and other health-related factors in the dataset show complex relationships. Important details about the dataset may be found in the correlation matrix, which uses correlation coefficients ranging from -1 to 1 to express these associations. Key findings include:
1. Heart Disease and Diabetes (0.168): A significant positive correlation indicates a possible link between diabetes and heart disease, indicating a possible increased risk of heart-related issues in those with diabetes.
2. Heart Disease and Age (0.233): This positive relationship highlights the fact that the risk of developing heart disease increases with age, revealing an age-dependent vulnerability.
3. Skin cancer and other cancers (0.150): The positive association suggests that there may be common risk factors or underlying processes that contribute to the occurrence of several types of cancer in addition to skin cancer.
4. Exercise and Diabetes (-0.135): This shows that regular physical activity may have a preventive impact on the development of diabetes, which is consistent with accepted health principles.
5. BMI and Weight (0.844): The strong positive correlation between BMI and weight highlights the relationship that exists between the two variables, indicating the role that each plays in defining an individual's entire body composition.
6. Exercise and Alcohol Consumption (0.118): The positive association between exercise and alcohol consumption points to a concurrent pattern, which shows that people who exercise may also drink.
These findings provide important insights for focused interventions and public health policies in addition to adding to a deeper knowledge of health connections.
df.corr()
| General_Health | Checkup | Exercise | Heart_Disease | Skin_Cancer | Other_Cancer | Depression | Diabetes | Arthritis | Sex | Age_Category | Height_(cm) | Weight_(kg) | BMI | Smoking_History | Alcohol_Consumption | Fruit_Consumption | Green_Vegetables_Consumption | FriedPotato_Consumption | Age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| General_Health | 1.000000 | 0.026057 | 0.037862 | -0.022939 | 0.018898 | 0.002223 | 0.001567 | -0.026001 | 0.011273 | -0.015768 | 0.029396 | 0.000088 | 0.023168 | 0.024794 | 0.001689 | 0.027691 | -0.004715 | -0.009803 | 0.002836 | 0.029396 |
| Checkup | 0.026057 | 1.000000 | -0.031425 | 0.085412 | 0.079136 | 0.086693 | 0.034371 | 0.128659 | 0.151135 | -0.099826 | 0.225873 | -0.089177 | 0.011150 | 0.065015 | -0.006792 | -0.045097 | 0.040962 | 0.049745 | -0.066167 | 0.225873 |
| Exercise | 0.037862 | -0.031425 | 1.000000 | -0.097616 | -0.008178 | -0.056206 | -0.081230 | -0.135287 | -0.124046 | 0.063906 | -0.128877 | 0.097187 | -0.067275 | -0.140239 | -0.094206 | 0.118789 | 0.137490 | 0.146838 | -0.033929 | -0.128877 |
| Heart_Disease | -0.022939 | 0.085412 | -0.097616 | 1.000000 | 0.093381 | 0.093845 | 0.030819 | 0.168441 | 0.155523 | 0.071980 | 0.233017 | 0.015581 | 0.048619 | 0.046389 | 0.109822 | -0.051777 | -0.019420 | -0.020643 | -0.012972 | 0.233017 |
| Skin_Cancer | 0.018898 | 0.079136 | -0.008178 | 0.093381 | 1.000000 | 0.150536 | -0.011644 | 0.038111 | 0.137625 | 0.007274 | 0.270318 | 0.005132 | -0.020923 | -0.028584 | 0.031116 | 0.019914 | 0.028005 | 0.030356 | -0.042705 | 0.270318 |
| Other_Cancer | 0.002223 | 0.086693 | -0.056206 | 0.093845 | 0.150536 | 1.000000 | 0.014678 | 0.066042 | 0.129958 | -0.042369 | 0.235955 | -0.044302 | -0.019371 | 0.005226 | 0.053710 | -0.023363 | 0.010677 | 0.004640 | -0.040664 | 0.235955 |
| Depression | 0.001567 | 0.034371 | -0.081230 | 0.030819 | -0.011644 | 0.014678 | 1.000000 | 0.048162 | 0.120713 | -0.141150 | -0.100843 | -0.093349 | 0.028763 | 0.095232 | 0.102184 | -0.024138 | -0.040335 | -0.062374 | 0.019190 | -0.100843 |
| Diabetes | -0.026001 | 0.128659 | -0.135287 | 0.168441 | 0.038111 | 0.066042 | 0.048162 | 1.000000 | 0.134115 | -0.011600 | 0.202778 | -0.042289 | 0.147966 | 0.199774 | 0.058165 | -0.116838 | -0.017819 | -0.033305 | -0.009640 | 0.202778 |
| Arthritis | 0.011273 | 0.151135 | -0.124046 | 0.155523 | 0.137625 | 0.129958 | 0.120713 | 0.134115 | 1.000000 | -0.104465 | 0.375686 | -0.103089 | 0.065784 | 0.139030 | 0.124312 | -0.049986 | 0.002429 | -0.007205 | -0.059778 | 0.375686 |
| Sex | -0.015768 | -0.099826 | 0.063906 | 0.071980 | 0.007274 | -0.042369 | -0.141150 | -0.011600 | -0.104465 | 1.000000 | -0.065572 | 0.705696 | 0.384814 | 0.015511 | 0.068577 | 0.116365 | -0.092029 | -0.072312 | 0.157993 | -0.065572 |
| Age_Category | 0.029396 | 0.225873 | -0.128877 | 0.233017 | 0.270318 | 0.235955 | -0.100843 | 0.202778 | 0.375686 | -0.065572 | 1.000000 | -0.125048 | -0.045051 | 0.018961 | 0.133802 | -0.044179 | 0.053164 | 0.077506 | -0.172412 | 1.000000 |
| Height_(cm) | 0.000088 | -0.089177 | 0.097187 | 0.015581 | 0.005132 | -0.044302 | -0.093349 | -0.042289 | -0.103089 | 0.705696 | -0.125048 | 1.000000 | 0.509410 | -0.018590 | 0.047971 | 0.128767 | -0.044161 | -0.023955 | 0.135174 | -0.125048 |
| Weight_(kg) | 0.023168 | 0.011150 | -0.067275 | 0.048619 | -0.020923 | -0.019371 | 0.028763 | 0.147966 | 0.065784 | 0.384814 | -0.045051 | 0.509410 | 1.000000 | 0.844531 | 0.052068 | -0.013201 | -0.088073 | -0.068928 | 0.118319 | -0.045051 |
| BMI | 0.024794 | 0.065015 | -0.140239 | 0.046389 | -0.028584 | 0.005226 | 0.095232 | 0.199774 | 0.139030 | 0.015511 | 0.018961 | -0.018590 | 0.844531 | 1.000000 | 0.031099 | -0.094286 | -0.075401 | -0.067751 | 0.055005 | 0.018961 |
| Smoking_History | 0.001689 | -0.006792 | -0.094206 | 0.109822 | 0.031116 | 0.053710 | 0.102184 | 0.058165 | 0.124312 | 0.068577 | 0.133802 | 0.047971 | 0.052068 | 0.031099 | 1.000000 | 0.060649 | -0.092310 | -0.028583 | 0.042474 | 0.133802 |
| Alcohol_Consumption | 0.027691 | -0.045097 | 0.118789 | -0.051777 | 0.019914 | -0.023363 | -0.024138 | -0.116838 | -0.049986 | 0.116365 | -0.044179 | 0.128767 | -0.013201 | -0.094286 | 0.060649 | 1.000000 | 0.004865 | 0.087926 | 0.025783 | -0.044179 |
| Fruit_Consumption | -0.004715 | 0.040962 | 0.137490 | -0.019420 | 0.028005 | 0.010677 | -0.040335 | -0.017819 | 0.002429 | -0.092029 | 0.053164 | -0.044161 | -0.088073 | -0.075401 | -0.092310 | 0.004865 | 1.000000 | 0.249732 | -0.099847 | 0.053164 |
| Green_Vegetables_Consumption | -0.009803 | 0.049745 | 0.146838 | -0.020643 | 0.030356 | 0.004640 | -0.062374 | -0.033305 | -0.007205 | -0.072312 | 0.077506 | -0.023955 | -0.068928 | -0.067751 | -0.028583 | 0.087926 | 0.249732 | 1.000000 | -0.068829 | 0.077506 |
| FriedPotato_Consumption | 0.002836 | -0.066167 | -0.033929 | -0.012972 | -0.042705 | -0.040664 | 0.019190 | -0.009640 | -0.059778 | 0.157993 | -0.172412 | 0.135174 | 0.118319 | 0.055005 | 0.042474 | 0.025783 | -0.099847 | -0.068829 | 1.000000 | -0.172412 |
| Age | 0.029396 | 0.225873 | -0.128877 | 0.233017 | 0.270318 | 0.235955 | -0.100843 | 0.202778 | 0.375686 | -0.065572 | 1.000000 | -0.125048 | -0.045051 | 0.018961 | 0.133802 | -0.044179 | 0.053164 | 0.077506 | -0.172412 | 1.000000 |
In the goal of uncovering the numerous relationships within the dataset, a complete exploration is aided by the production of a correlation heatmap. Using the seaborn library as a resource, the correlation matrix is carefully examined to see how different columns interact. A heatmap provides an effective way to visualize the correlation coefficient, which indicates the direction and strength of relationships. The image, which was produced by using the command sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm"), is an effective tool for seeing trends, figuring out dependencies, and pointing out possible directions for more research. This systematic analysis contributes to a comprehensive insight into the many interrelationships buried within the health-related variables by offering a visually intuitive picture of the data's underlying connections.
correlation_matrix = df.corr()
plt.figure(figsize=(16, 14))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()
The analysis focused on those members of the dataset who were diagnosed with heart disease; this group was identified using the criteria 'Heart_Disease == 1.' This subset was taken out for additional analysis by means of a data filter application. After that, a graphic was created to show how this particular cohort's Body Mass Index (BMI) was distributed. With much care and attention to detail, a histogram was created using the Plotly Express module. The resulting plot, called "BMI Distribution for People with Heart Disease," provides an excellent task of illustrating the range of BMIs. "BMI" is indicated on the x-axis, while the frequency of individuals is quantified on the y-axis. The 'bargap' option was carefully changed to regulate the distance between the histogram bars in order to fine-tune the appearance. This visual representation offers a perceptive look at the BMI distribution patterns among people with heart disease diagnoses.
Output:
As we can see, the vast majority of people with Heart Disease 3240, have BMI levels that fall within the category of overweight which is 28 - 29.999, signifying that the people with Heart Disease are obese.
The normal BMI range is 18.5 to 24.9, and if the range falls between 25.0 and 29.9, they are known overweight, and BMI levels less than 18.5 are considered underweight.
# Selecting only people with heart disease
heart_disease_df = df[df['Heart_Disease'] == 1]
fig = px.histogram(
heart_disease_df,
x='BMI',
nbins=30,
labels={'BMI': 'Body Mass Index'},
title="BMI Distribution for People with Heart Disease",
)
fig.update_layout(
xaxis_title="BMI",
yaxis_title="Count",
bargap=0.1, # Adjust the gap between bars
)
fig.show()
The provided code builds a histogram showing the link between the presence or absence of heart disease (Heart_Disease) and smoking history (Smoking_History) using Plotly Express. The bars on the x-axis are color-coded to distinguish between those with and without heart disease, and each bar represents a distinct smoking history group. The generated plot, "Smoking History vs. Heart Disease," provides a concise summary of the distribution of heart disease cases among different smoking history categories.The plot, displayed using fig.show(), presents a binary encoding where 0 signifies the absence of heart disease and no smoking history, while 1 indicates the presence of heart disease and a corresponding smoking history.
Output:
The plot's observation implies that while people with heart disease are more likely to have smoked in the past, people without a smoking history typically do not develop heart disease.
import plotly.express as px
fig = px.histogram(
df,
x='Smoking_History',
color='Heart_Disease',
labels={'Smoking_History': 'Smoking History', 'Heart_Disease': 'Heart Disease'},
title='Smoking History vs. Heart Disease',
)
fig.update_layout(
xaxis_title='Smoking History',
yaxis_title='Count',
)
fig.show()
Utilizing data grouping (groupby()) to evaluate the average prevalence of skin cancer in various age groups and genders. By calculating the mean skin cancer prevalence and resetting the index, a new distribution function named "skin_cancer_prevalence" is formed, using the 'Age_Category' and 'Sex' variables for grouping. The following visualization is a bar chart created with Seaborn, where the y-axis represents mean skin cancer prevalence and the x-axis represents gender and age groups. The figure size of the chart, as it is displayed, is 12 by 6. With the headline "Skin Cancer Prevalence by Age Category and Gender," the x- and y-axes have corresponding labels on the plot.
Output:
The output chart highlights differences between male ( denoted as 1, displayed in orange) and female ( denoted as 0, shown in blue) populations by visually representing the occurrence of skin cancer across age groups. The chart is important because it shows differences in the occurrence of skin cancer by gender across various age groups.
skin_cancer_prevalence = df.groupby(['Age_Category', 'Sex'])['Skin_Cancer'].mean().reset_index()
# Creating a bar chart to visualize skin cancer prevalence
plt.figure(figsize=(12, 6))
sns.barplot(data=skin_cancer_prevalence, x='Age_Category', y='Skin_Cancer', hue='Sex')
plt.title('Skin Cancer Prevalence by Age Category and Gender')
plt.xlabel('Age Category')
plt.ylabel('Prevalence Rate')
plt.show()
The method for predicting heart disease that is being presented using machine learning and feature engineering. As the target variable 'y,' the 'Heart_Disease' column is first removed from the dataset; the remaining features form the feature variable 'X.' The dataset is then divided into training and testing sets using the 'train test split' (X, y, test_size=0.2, random_state=42) function, where a random seed of 42 is used for reproducibility and a test size of 20% is specified. Using the training dataset, a RandomForestClassifier with 100 decision trees is trained. Next, a 'feature_importances' variable is produced by applying the trained random forest model to evaluate the significance of each feature. 'X_train_selected' and 'X_test_selected' are the outcomes of applying feature selection to the training and testing data by utilizing this information. The 'sfm.get_support(indices=True)' function helps in the selection process by providing the index of the selected features, making it possible to extract and print their names later. To improve predictive accuracy in the context of classifying heart diseases, this all-inclusive methodology combines feature selection, model training, and data preparation.
# Spliting the data into feature matrix (X) and target variable (y)
X = df.drop(columns=['Heart_Disease'])
y = df['Heart_Disease']
#Spliting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Training a random forest classifier model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Performing feature selection
feature_importances = model.feature_importances_
# Creating a SelectFromModel object to select features based on a threshold
sfm = SelectFromModel(model, threshold='median')
sfm.fit(X_train, y_train)
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)
# Now, X_train_selected and X_test_selected contain the selected features
# Printing the selected features
selected_feature_indices = sfm.get_support(indices=True)
selected_features = X.columns[selected_feature_indices]
print("Selected features:", selected_features)
Selected features: Index(['General_Health', 'Age_Category', 'Height_(cm)', 'Weight_(kg)', 'BMI',
'Alcohol_Consumption', 'Fruit_Consumption',
'Green_Vegetables_Consumption', 'FriedPotato_Consumption', 'Age'],
dtype='object')
Using 100 decision trees, the Random Forest Classifier predicts the presence of heart disease with a noteworthy accuracy of 91.7% on the testing dataset. This assessment includes a number of indicators, such as a classification report and a confusion matrix, to give a thorough picture of the model's functionality. According to the confusion matrix, there were 50,498 true negatives (correctly predicted cases of non-heart disease) out of 55,218 instances; 189 false positives (misclassified as cases of heart disease when they are not); 4,397 false negatives (misclassified as cases of non-heart disease when they are); and 134 true positives (correctly predicted patients with heart disease).
The following classification report explores measures for both classes, including precision, recall, and F1-score. Notably, there is a 92% precision rate for the absence of heart illness and a 41% precision rate for the diagnosis of heart disease. The greatest (100%) recall for the absence of heart disease highlights the model's accuracy in detecting cases that do not involve heart disease. However, the recall for heart illness is only 3%, suggesting a significant difficulty in correctly identifying positive cases. The F1-scores also show this gap, with 6% for heart disease and 96% for non-heart disease. With a weighted average accuracy of 92%, more investigation is clearly required, especially to resolve false negatives and improve the model's sensitivity to cases of heart disease.
# 1: Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_y_pred = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_y_pred)
rf_conf_matrix = confusion_matrix(y_test, rf_y_pred)
rf_class_report = classification_report(y_test, rf_y_pred)
# Printing the evaluation results
print("Random Forest Classifier:")
print("Accuracy:", rf_accuracy)
print("Confusion Matrix:\n", rf_conf_matrix)
print("Classification Report:\n", rf_class_report)
Random Forest Classifier:
Accuracy: 0.9169473722336919
Confusion Matrix:
[[50498 189]
[ 4397 134]]
Classification Report:
precision recall f1-score support
0 0.92 1.00 0.96 50687
1 0.41 0.03 0.06 4531
accuracy 0.92 55218
macro avg 0.67 0.51 0.51 55218
weighted avg 0.88 0.92 0.88 55218
It is necessary to train the Support Vector Machine (SVM) model and then assess its performance in order to use it for prediction on the testing dataset. The SVM model is trained by an accurate approach to provide predictions, which are evaluated by comparing the predictions to the target values. The assessment consists of determining accuracy, creating a confusion matrix, and thoroughly analyzing performance metrics through the use of a categorization report. Important indicators of the SVM model's effectiveness in classifying objects are the comprehensive insights into its accuracy, confusion matrix, and classification report. This systematic method facilitates an in-depth understanding of the predictive power of the model and helps with well-informed decision-making on more refinements or alternative model considerations.
Output:
The accuracy of the SVM model is 91.8%. The confusion matrix, on the other hand, shows a notable imbalance: the model correctly detects cases without heart disease (class 0), but it is unable to recognize any cases of heart disease (class 1). As a result, class 1's precision, recall, and F1-score metrics are zero. The SVM's ability to accurately forecast class 0 instances is the main factor influencing the weighted average accuracy of 92%. The macro-average measurements, which show that precision, recall, and F1-score are all focused around 0.5, highlight the difficulties caused by class imbalance. These results call for a thorough assessment and investigation of possible improvements or different models to solve the model's flaws.
# 2: Support Vector Machine (SVM)
svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_y_pred = svm_model.predict(X_test)
svm_accuracy = accuracy_score(y_test, svm_y_pred)
svm_conf_matrix = confusion_matrix(y_test, svm_y_pred)
svm_class_report = classification_report(y_test, svm_y_pred)
print("Support Vector Machine (SVM):")
print("Accuracy:", svm_accuracy)
print("Confusion Matrix:\n", svm_conf_matrix)
print("Classification Report:\n", svm_class_report)
Support Vector Machine (SVM):
Accuracy: 0.917943424245717
Confusion Matrix:
[[50687 0]
[ 4531 0]]
Classification Report:
precision recall f1-score support
0 0.92 1.00 0.96 50687
1 0.00 0.00 0.00 4531
accuracy 0.92 55218
macro avg 0.46 0.50 0.48 55218
weighted avg 0.84 0.92 0.88 55218
/Users/ishratshaikh/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/ishratshaikh/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/ishratshaikh/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
This code trains and evaluates a logistic regression model, which is a predictive tool for forecasting results. A different set of data is used to evaluate the predictive accuracy of the model after it has been trained on a specific training dataset. The evaluation looks at the confusion matrix, which lists the occurrences of accurate and inaccurate predictions together with the model's overall accuracy. In addition, a classification report is produced to offer a thorough summary of the model's performance in several classes. Understanding the precision, recall, and F1-score metrics is made easier with the help of this report, which provides information on the accuracy and efficiency of the Logistic Regression model in predicting heart disease cases.
Output:
The accuracy obtained by applying Logistic Regression on the dataset is 91.7%. The model is good at identifying cases without heart disease (class 0), but it has trouble identifying cases with heart disease (class 1), which leads to more false negatives. Key metrics are shown in the classification report, which highlights a 41% precision for heart disease and a 2% recall, highlighting the model's challenges in correctly detecting positive cases. The weighted average accuracy of 92% demonstrates how well the algorithm can predict patients free of heart disease. On the other hand, macro-average metrics highlight the effect of class imbalance, which calls for a thorough analysis and possible improvements to address the model's limitations.
# 3: Logistic Regression
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
lr_y_pred = lr_model.predict(X_test)
lr_accuracy = accuracy_score(y_test, lr_y_pred)
lr_conf_matrix = confusion_matrix(y_test, lr_y_pred)
lr_class_report = classification_report(y_test, lr_y_pred)
print("Logistic Regression:")
print("Accuracy:", lr_accuracy)
print("Confusion Matrix:\n", lr_conf_matrix)
print("Classification Report:\n", lr_class_report)
Logistic Regression:
Accuracy: 0.9173095729653374
Confusion Matrix:
[[50569 118]
[ 4448 83]]
Classification Report:
precision recall f1-score support
0 0.92 1.00 0.96 50687
1 0.41 0.02 0.04 4531
accuracy 0.92 55218
macro avg 0.67 0.51 0.50 55218
weighted avg 0.88 0.92 0.88 55218
/Users/ishratshaikh/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
To sum up, the research offers insightful information about potential risk factors for heart disease. After adjusting its hyperparameters, the Random Forest Classifier model produced an accuracy of about 38.8%. While extra feature research and optimization may improve the model's performance, it can still be used for predictive purposes.
Targeted interventions and individualized healthcare plans are made possible by an understanding of important correlations, risk factors, and lifestyle characteristics. To make wise decisions, it is essential to evaluate model outputs in conjunction with domain knowledge and medical experience. The visualizations that are shown help to explain the complex linkages and patterns found in the dataset. Improvement of features, ongoing observation, and cooperation between data scientists and medical practitioners can lead to more precise forecasts and improved health results.
It is critical to recognize the limits and possible flaws in the dataset, just like with any prediction model. A strong and dependable predictive analytics system must include frequent updates, a variety of datasets, and continuous validation.